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Abstract 

We conducted genome sequencing of the filamentous fungus Aspergillus sojae NBRC4239 isolated from 
the koji used to prepare Japanese soy sauce. We used the 454 pyrosequencing technology and investigated 
the genome with respect to enzymes and secondary metabolites in comparison with other Aspergilli 
sequenced. Assembly of 454 reads generated a non-redundant sequence of 39.5-Mb possessing 13 033 
putative genes and 65 scaffolds composed of 557 contigs. Of the 2847 open reading frames with Pfam 
domain scores of >150 found in A. sojae NBRC4239, 81.7% had a high degree of similarity with the 
genes of A. oryzae. Comparative analysis identified serine carboxypeptidase and aspartic protease genes 
unique to A. sojae NBRC4239. While A. oryzae possessed three copies of a-amyalse gene, A. sojae 
NBRC4239 possessed only a single copy. Comparison of 56 gene clusters for secondary metabolites 
between A. sojae NBRC4239 and A. oryzae revealed that 24 clusters were conserved, whereas 32 clusters 
differed between them that included a deletion of 18 508 bp containing mfsl, maol, dmaT, and pks-nrps 
for the cyclopiazonic acid (CPA) biosynthesis, explaining the no productivity of CPA in A. sojae. The A. 
sojae NBRC4239 genome data will be useful to characterize functional features of the koji moulds used 
in Japanese industries. 
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1 . Introduction 

Koji moulds are widely used in the production of 
traditional fermented foods and beverages such as 
Japanese miso, soy sauce, and sake. Two typical koji 
moulds, Aspergillus sojae and A. oryzae, are used. 
During the fermentation process, koji moulds act by 
breaking down the ingredients. Each species of koji 
moulds reacts differently to the ingredients used 
and must therefore be selected based on the desired 



product. For example, A. sojae is selected to produce 
miso and soy sauce due to its high proteolytic ability, 
and A. oryzae is used widely in sake, miso, and soy 
sauce production for its high amylolytic ability. 
Among Aspergillus strains deposited in the RIKEN 
Bioresource Center Japan Collection of 
Microorganisms (http://www.jcm.riken.jp/JCM/ 
JCM_DB.shtm), 1 5 strains out of 53 in A. oryzae 
strains were derived from sake koji, 6 from miso, and 



© The Author 2011. Published by Oxford University Press on behalf of Kazusa DNA Research Institute. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// 
creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, 
provided the original work is properly cited. 



1 66 



Draft Genome Sequencing of Aspergillus sojae 



[Vol. 1 8 



17 from soy sauce to/7, while 16 strains out of 20 
strains of A. sojae were derived from soy sauce koji. 

For taxonomic classification of Aspergillus species, 
molecular strategies have been developed to discrimi- 
nate several Aspergillus species. 1 Aspergillus oryzae and 
A, sojae are classified in the Aspergillus section Flavi, 
which also includes plant pathogen A. flavus and A. 
parasiticus that produces aflatoxins known to be carci- 
nogenic substances. The analysis based on restriction- 
site polymorphisms of genes coding for 1 1 proteins 
and sequences of five of those genes suggested that 
A. oryzae is a species derived from A. flavus through 
human handling. 2 From a viewpoint of evolutionarily 
close relation of/4, oryzae and A sojae with pathogenic 
Aspergillus species, it is vital to distinguish between 
aflatoxin productive and non-productive moulds and 
to select the latter for industrial use. It has been 
reported that A. oryzae does not produce these sub- 
stances from expressed sequence tag (EST) analysis 
of/\. oryzae RIB40, in which many of the aflatoxin bio- 
synthesis gene clusters were found to be unex- 
pressed. 3 For A. sojae, a termination point mutation 
in afIR which controls transcription of aflatoxin bio- 
synthesis gene clusters and lack of the polyketide 
synthase (PKS) gene are correlated with its aflatoxin 
non-productivity. 4 

The genome of A. oryzae RIB40 was recently com- 
pletely sequenced and the comprehensive analysis 
showed that this strain possesses 1 34 protease 
genes including many paralogous genes and multiple 
copies of a-amylase and a-glucosidase genes. 5 These 
genetic features may account for its high proteolytic 
and amylolytic abilities in A. oryzae RIB40. 

Many of the moulds classified in Aspergillus section 
Flavi are known to produce various secondary metab- 
olites. The/4, oryzae RIB40 genome encodes genes for 
numerous secondary metabolites other than aflatox- 
ins, 6 although EST and microarray analysis of A. 
oryzae RIB40 suggested that it has almost no pro- 
ductivity of secondary metabolites. 7 These features 
of A. oryzae on the basis of quality, productivity, and 
safety may be one of the reasons for that A. oryzae 
strains have gradually been selected as industrially 
useful strains. 8 

Though similar studies in A. sojae have not been 
conducted so much as A. oryzae, it is thought to be 
a domesticated strain selectively bred from natural 
strains as well as A. oryzae above. 

However, the whole genetic information of A. sojae 
is insufficient to investigate the functional features 
important for its industrial use including the protease 
and amylase activities as well as safety. Therefore, we 
conducted the whole-genome sequencing of the prac- 
tical strain A. sojae NBRC4239 isolated from Japanese 
soy sauce koji by using the next generation sequencer 
454 pyrosequencer. The genetic information of 



A. sojae NBRC4239, combined with that of A. oryzae, 
will synergistically provide the knowledge for deep 
understanding of the biological nature of industrially 
important koji mould and for its further development 
of usefulness in food science field. 

2. Materials and methods 

2.1 . Strain and DNA preparation 

Aspergillus sojae NBRC42 39 was obtained from 
NBRC (http://www.nbrc.nite.go.jp/). This strain is a 
practical strain isolated from Japanese soy sauce koji. 

Aspergillus sojae NBRC4239 was incubated in PD 
liquid media (1% peptone, 2% dextrin, 0.5% KH 2 P0 4 , 
0.1% NaN0 3 , 0.05% MgS0 4 , and 0.1% casamino 
acids, pH 6.0) on a shaker at 1 50 rpm at 30°C for 
24 h. After collection on a mortar, mould was frozen 
in liquid nitrogen and then crushed with a pestle. 
The genome was extracted from this mould using a 
Wizard Genomic DNA Purification Kit (Promega 
Corporation, USA) and purified using a DNeasy 
Blood & Tissue Kit (QIAGEN Sciences, USA), according 
to the respective manufacturer's protocols. 

2.2. Genome sequencing and data assembly 

For GS FLX Titanium fragment sequencing, 500 ng 
of genomic DNA was sheared into DNA fragments 
ranging from 300 to 800 bp by nebulization. After 
both ends of the DNA fragments were repaired and 
phosphorylated, two types of adaptors (A and B) 
were ligated to the DNA fragments. Next, the DNA 
fragments carrying the 5'-biotin of adaptor B from 
the ligation mixture were immobilized onto magnetic 
streptavidin-coated beads. The single-stranded tem- 
plate DNA (ssDNA) molecules carrying Adaptor A at 
5'-end and Adaptor B at 3'-end were isolated by alka- 
line denaturation. These purified ssDNAs were then 
hybridized to DNA capture beads and clonally ampli- 
fied by an emulsion polymerase chain reaction (PCR) 
method. After denaturation of the amplified double- 
stranded DNAs on the capture beads, these beads 
with single-stranded molecules were spread onto 
each well of a pico titre plate. For GS FLX Titanium 
Paired-end sequencing, 1 5 |xg of genomic DNA was 
sheared into DNA fragments ranging from ~8 kb by 
fragmentation. After the ends of the DNA fragments 
were repaired and internal adaptors were ligated to 
the DNA fragments, each DNA fragment was circular- 
ized and self-ligated. The circular DNA was sheared 
into DNA fragments ranging from 300 to 800 bp by 
nebulization. The DNA fragments carrying the 5'- 
biotin of internal adaptor from the ligation mixture 
were immobilized onto magnetic streptavidin-coated 
beads. Paired-end sequencing was carried out similar 
to the method described above. Two sequencing 
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runs in total were carried out. The GS FLX sequence 
data were assembled using Newbler assembly 
softwa re. 

2.3. Gene prediction and annotation 
GlimmerHMM, 9 AUGUSTUS, 10,11 SNAP, 12 and 

GeneMark + ES 1 3 were used as ab initio predictors, 
and Genewise 14 was used as the evidence-based pre- 
dictor. The ab initio GlimmerHMM, AUGUSTUS, and 
SNAP parameters were trained on all/4, oryzae RIB40 
gene models, while GeneMark + ES performed an 
iterative self-training procedure. The amino acid 
sequence of A. oryzae RIB40 was used for alignment 
by Genewise. GFF files obtained from these prediction 
programmes were incorporated by Evigan 15 to 
produce prediction results. Out of the predicted 
open reading frames (ORFs), those with more than 
100 amino acid residues were selected as predictor 
genes. 

Amino acid sequences of the predictor genes were 
matched with non-redundant protein database (nr, 
NCBI) by BLAST 16 and were annotated based on iden- 
tity. tRNA were predicted by tRNAscan-SE. 1 7 

2.4. Comparative genomics 

Nucleotide and amino acid sequences of A. oryzae 
RIB40 were obtained from DOGAN (http://www. 
nbrc.nite.go.jp/dogan/). Nucleotide and amino acid 
sequences of A. flavus NRRL3357, A. fumigatus 
Af293, and A. nidulans FGSC A4 were obtained from 
the Aspergillus Genome Database (AspGD). 18 
Putative domains were predicted with the HMMER 19 
programme hmmscan using the hidden Markov 
models from the pfam database. 20 For A. oryzae 
RIB40, A. flavus NRRL3357, A. fumigatus Af293, and 
A. nidulans FGSC A4, ORFs with Pfam domain scores 
of >1 50 were subject to comparison. 

2.4.7. Protease A domain list for protease was 
created by matching amino acid sequences registered 
in MEROPS 21 with the HMMER from the Pfam data- 
base. Based on this list, amino acid sequences with 
protease domain scores > 1 50 were compared 
between A. sojae NBRC4239 and A. oryzae RIB40. 
For phylogenetic analysis, multiple alignments were 
carried out by ClustalX 22 and phylogenic trees were 
drawn with TreeView. 23 

2.4.2. Amylolytic enzymes It is known that out 
of the glycoside hydrolases, families 13, 15, and 
31 are involved in amylolysis. Entries of glycoside 
hydrolase belonging to these three families in 
relation to A. oryzae RIB40 were extracted from the 
Carbohydrate Active Enzyme (CAZy) database. 24 
From these data, amylolytic enzymes possessed 



by A. sojae NBRC4239 strains were predicted. 
Furthermore, for these predicted amylolytic 
enzymes, checks were made to confirm the presence 
of active centre residues (data not shown). 
Alignment of nucleotide and amino acid sequences 
were carried out by GENETYX, and BLAST was used 
for homology searches. PCR primer sequences used 
for partial nucleotide sequence checks are shown in 
Supplementary Table S1. 

2.4.3. Secondary metabolism Sequences of sec- 
ondary metabolite gene clusters in A. oryzae RIB40 
were obtained from the Secondary Metabolite 
Unknown Region Finder (SMURF) database. 25 
Cyclopiazonic acid (CPA) biosynthesis gene cluster 
sequences of A. flavus were obtained from Broad 
Institute (http://www.broadinstitute.org/annotation/ 
genome/Aspergillus_group) with reference to 
reports by Chang et al. 26 Sequences near the aflatrem 
gene clusters in A. flavus and A. oryzae were obtained 
from reports by Nicholson et al., 27 where the 
sequences for A. flavus NRRL3357 and A. oryzae 
RIB40 were obtained from Broad Institute and 
DOGAN, respectively. For A. flavus NRRL6541, 
sequences ATM1 (AY559849.2 Gl:1 61 62 1 808) and 
ATM 2 (AM921 700.1 Gl:1 6228681 8) entered in 
GenBank (http://www.ncbi.nlm.nih.gov/genbank) 
were used. Harr plots were generated by In silico 
Molecular Cloning Series IMC, genomics edition (In 
Silico Biology, Inc.). PCR primer sequences used for 
partial nucleotide sequence checks are shown in 
Supplementary Table S2. 



2.5. Accession numbers 

Nucleotide sequence data were entered into the 
DDBJ/EMBL/GenBank DNA databases. Accession 
numbers for the 65 scaffold sequences are 
DF093557-DF093585 and for the 1034 contig 
sequences are BACA01 000001 - BACA01 001 034. 



3. Results and discussion 

3.1. Sequencing and assembly 

We obtained 1 034 contigs (>1 00 bp) and 65 scaf- 
folds by assembling the reads obtained from sequen- 
cing (Supplementary Table S3). The 65 scaffolds are 
composed of 557 contigs, thus 477 contigs did not 
make up scaffolds. Out of the 1034 contigs, 707 
were >500 bp. Total length of the contigs and scaf- 
folds each exceeded 39 Mb. As the genome size 
reported for A. oryzae RIB40 is 37.6Mb, 5 the 
genome of A. sojae NBRC4239 was predicted to 
exceed that of A. oryzae RIB40. 
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3.2. Gene prediction and annotation 

ORF prediction was carried out on the 65 scaffolds 
and the 47 7 contigs that did not make up scaffolds. 
As a result, we obtained 13 033 ORFs with amino 
acid residues >1 00. Also, 275 tRNAs were predicted 
by tRNAscan-SE. Table 1 showed the comparison of 
ORFs of A. oryzae RIB40 and A. sojae NBRC4239. 

3.3. Comparative genomics 

The proportion of ORFs with Pfam domain scores of 
>150 was 2 847/13 033 (21.8%) for A. sojae 
NBRC4239, 2868/12 074 (23.8%) for A. oryzae 
RIB40, 2880/1 2 604 (22.9%) for A. flavus 
NRRL3357, 251 7/9887 (25.5%) for A. fumigatus 
Af293, and 2641/1 1 272 (23.4%) for A. nidulans 
FGSC A4. 

ORFs with Pfam domain scores of >1 50 were com- 
pared betweenA. sojae NBRC4239 and A. oryzae RIB40 
by BLASTP, and it was found that 2326/2847 (81 .7%) 
had >90% identity (Supplementary Fig. S1a). Next, 
397/2847 ORFs from A. sojae NBRC4239 were 
found to have <70% identity with those from A. 
oryzae RIB40. Of those, 1 34 ORFs had >90% identity 
and 192 ORFs had <70% identity in the nucleotide 
sequence to the corresponding regions in A. oryzae 
RIB40 by tBLASTN. Therefore, the 1 34 ORFs found in 
A. sojae NBRC4239 may have been missed in the 
gene prediction of A. oryzae RIB40, and the ortholo- 
gous ORFs to the 1 92 ORFs may be absent in A. 
oryzae RIB40. These 1 92 ORFs in A. sojae NBRC4239 
were matched with A. flavus NRRL3357 genes with 
domain scores >150 using BLASTP. As a result, 22 
had >90% identity and 170 had <70% identity. 
These results indicated that the 22 ORFs are present 
in both A. sojae NBRC4239 and A. flavus NRRL33 57 
but absent in A. oryzae RIB40. The 1 70 ORFs with 
<70% identity were matched with nrof the NCBI data- 
base by BLASTP, and 132 ORFs were found to have 



Table 1. Comparison of ORFs of A. oryzae RIB40 and A. sojae 
NBRC4239 





A. oryzae RIB40 


A. sojae NBRC4239 


Size of assembly (MB) 


37.6 


39.5 


GC content (%) 


48.2 


48.1 


tRNA genes 


270 


275 


Number of ORFs 


1 2, 074 


1 3, 033 


Average ORF size 


449.8 


455.9 


Min ORF size 


101 


101 


Max ORF size 


6,886 


7,566 



The methods of the sequencing and the ORF prediction pro- 
cedure were different in A. oryzae and A. sojae, and the value 
1 01 , the minimum size, just indicated the artificial value for 
cutoff. 



<70% identity. Thus, these 1 32 ORFs may be unique 
to A. sojae NBRC4239 (Supplementary Fig. S1 b). 

3.3.1. Protease Out of the 2847 ORFs of A. sojae 
NBRC4239 and 2868 of A. oryzae RIB40, 83 ORFs in 
A. oryzae RIB40 and 76 in A. sojae NBRC4239 had 
protease domain scores >150; thus, A. sojae 
NBRC4239 had seven fewer ORFs. The total number 
of predicted protease genes in A. oryzae RIB40 was 
considerably less than the reported 1 34, 5 and this 
difference is likely due to the strict domain score set 
at >1 50. 

ORFs with domain scores > 1 50 from A. sojae 
NBRC4239 and A. oryzae RIB40 were sorted and com- 
pared by domains. The number of proteases with 
specific domains was not very different in either 
species. Four types of proteases in A. sojae 
NBRC4239 had one more gene than those in A. 
oryzae RIB40, respectively. Eleven types of proteases 
in A. sojae NBRC4239 had one gene less than those 
in A. oryzae RIB40, respectively. In both A. oryzae 
RIB40 and A. sojae NBRC4239, serine carboxypepti- 
dases were most abundant, followed by aspartic pro- 
teases. Aspergillus sojae NBRC4239 had 13 serine 
carboxypeptidases, which was one more than A. 
oryzae RIB40, and it had seven aspartic proteases 
which was two less than A. oryzae RIB40. 

Under the strict condition of protease domain 
scores >150, we found no significant difference in 
the number of protease genes between the two 
species. However, by using a less strict condition, a 
difference in the number of protease genes may be 
observed between A. sojae and A. oryzae. 



3.3.1.1. Serine carboxypeptidase 

Phylogenic tree for serine carboxypeptidases based 
on sequences with domain scores >150 was con- 
structed and then compared for A. sojae NBRC4239, 
A. oryzae RIB40, A. flavus NRRL3357, A. fumigatus 
Af293, and A. nidulans FGSC A4 (Fig. 1). We found 
that A. sojae NBRC4239 possesses a serine carboxy- 
peptidase gene (scaffold00048.369) that has low 
sequence similarity with the other four species. This 
gene was also included in the 132 ORFs that had 
<70% identity against nr (refer to the 'Comparative 
genomics' section). The similarity search of scaf- 
fold00048.369 against the ORFs of A. oryzae RIB40 
by BLASTP identified a gene AO0901 03000026 
with the closest match of 56% identity. 
AO0901 03000026 was annotated as a serine car- 
boxypeptidase in A. oryzae RIB40 (Fig. 1). In addition, 
Scaffold00048.369 was found to have the highest 
similarity to carboxypeptidase S1 in Neosartorya 
fischeri NRRL181 with 58% identity by searching 
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Figure 1. Phylogenic analysis of serine carboxypeptidase in five Aspergillus species. ORFs of serine carboxypeptidase with Pfam domain 
scores of >150 were analysed by ClustalW and were drawn by TreeView for five Aspergillus species. Filled circles, A. sojae; filled 
squares, A. oryzae; filled diamonds, A. flavus; triangle, A. nidulans; inverted triangles, A. fumigatus. Blue lines indicate the genes in 
common with all five Aspergillus species. Green lines indicate the genes in common with three species in Aspergillus section Flavi. 
Red shows the gene unique to A. sojae. Oranges show extra homologues in A. oryzae reported in Machida et al. 6 



against the nr NCBI database by BLASTP. From these 
results, scaffold00048.369 is likely to be a serine 
carboxypeptidase unique to A. sojae NBRC4239. We 
confirmed the expression of scaffold00048.369 by 
RT-PCR for mRNA isolated from A. sojae NBRC4239 
incubated in wheat bran media. The sequencing of 
the RT-PCR product revealed a sequence identical 
to the predicted ORF, indicating that our ORF predic- 
tion for scaffold00048.36 was correct, and this gene 
is expressed in wheat bran media (data not shown). 

Phylogenic analysis showed that serine carboxypep- 
tidase genes are classified into five clusters that are in 
common with the five Aspergillus species (C1 -5), and 
seven clusters that are in common only with three 
species of Aspergillus section Flavi (F1-7). Three of 
the five common clusters contained serine carboxy- 
peptidase genes that are unique to Aspergillus 



section Flavi, in addition to the putative orthologous 
genes. The serine carboxypeptidases unique to A. 
oryzae RIB40 previously reported 6 are therefore con- 
sidered to be in common with the Aspergillus 
section Flavi. It is conceivable that A. sojae and A. 
oryzae have been used widely in miso and soy sauce 
fermentation because of their possession of highly 
similar protease genes. 



3.3.1.2. Aspa rtic protease 

Phylogenic tree for aspa rtic proteases based on 
sequences with domain scores of > 1 5 0 was con- 
structed and then compared for A. sojae NBRC4239, 
A. oryzae RIB40, A. flavus NRRL3357, A. fumigatus 
Af293, and A. nidulans FGSC A4 (Fig. 2). We found 
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Figure 2. Phylogenic analysis of aspartic protease in five Aspergillus species. ORFs of aspartic protease with Pfam domain scores of >1 50 
were analysed by ClustalW and were drawn by TreeView for 5 Aspergillus species. Filled circles, A. sojae; filled squares, A. oryzae; filled 
diamonds, A. flavus; triangle, A. nidulans; inverted triangles, A. fumigatus. Blue lines indicate the genes in common with all five 
Aspergillus species. Green line indicates the genes in common with three species in Aspergillus section Flavi. Red shows the gene 
unique to A. sojae. Oranges show extra homologues in A. oryzae reported in Machida et al. e 



that A. sojae NBRC4239 possesses an aspartic pro- 
tease gene (scaffold00063.1451) that showed 
low in sequence similarity with those in the other 
four species. This gene was also included in the 
132 ORFs that had <70% identity against nr 
(refer to the 'Comparative genomics' section). 
Scaffold00063.1 451 was searched against the 
ORFs of A. oryzae RIB40 by BLASTP, and 
AO090701 000002 was found to be the closest 
gene with 33% identity. AO090701 000002 was 
annotated as an aspartic protease in A. oryzae RIB40 
(Fig. 2). Similarity search of scaffold00063.1 451 
identified yapsin of Penicillium marneffei ATCC1 8224 
with the highest similarity of 46% identity against 
the nr NCBI database by BLASTP. From these results, 



scaffold00063.1 451 is likely to be an aspartic pro- 
tease unique to A. sojae NBRC4239. We confirmed 
the expression of scaffold 00 063.1 451 by RT-PCR fol- 
lowed by sequencing, as described in the 'Serine car- 
boxy peptidase' section, indicating that our ORF 
prediction for scaffold00063.1 451 was correct, and 
this gene is expressed in wheat bran media (data 
not shown). 

Phylogenic analysis showed that aspartic protease 
genes are classified into four clusters that are in 
common with the five Aspergillus species (C1-4). 
Only one cluster was in common with only the 
three species belonging to Aspergillus section Flavi 
(F1), and other three clusters were shared with two 
of the three species. These data suggested that 
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aspartic protease genes are less conserved among 
Aspergillus section Flavi, in contrast to serine carboxy- 
peptidase genes. 

3.3.2. Amylolytic enzymes 

3.3.2.1. Amylolytic enzymes in A. sojae 
NBRC4239 and A. oryzae RIB40 

It is known in general that/4, sojae has lower amylo- 
lytic activity compared with A. oryzae. We studied the 
genes for amylolytic enzymes in A. oryzae RIB40 and 
A. sojae NBRC4239 to analyse this difference. First, 
we compared the number of glycoside hydrolases 
belonging to Family 13, 15, and 31 in both strains, 
respectively (Supplementary Table S4). We found no 
difference in gene numbers of glycoside hydrolases 
between A. sojae and A. oryzae for Family 31 including 
a-glucosidase (EC.3.2.1 .20), and for Family 1 5 includ- 
ing glucoamylase (EC.3.2.1 .3). Thus, there is unlikely 
to be a difference in these enzymatic activities 
between the two Aspergillus strains. In contrast, we 
found that A. sojae has two copies less glycoside 
hydrolases in Family 1 3 including a-amylase 
(EC.3.2.1.1) than those in A. oryzae. Missing of the 
two genes in A. sojae was due to the copy number 
variation between the two strains; A. sojae only has 
one copy of amyB compared with the three copies 
(AO090023000944: amyl , AO0901 200001 96: 
amy2, and AO090003001 2 1 0: amy 3) in A. 
oryzae. 28 In A. oryzae, amyB codes for so-called Taka- 
amylase, which is important for amylolysis. 
Therefore, a decreased copy number of amyB ortholo- 
gues in A. sojae likely accounts for the lower amyloly- 
tic ability of A. sojae than that of A. oryzae. 

3.3.2.2. a-Amylase genes and their flanking 
regions 

The above-mentioned three a-amylase genes and 
their flanking 20-kb regions of A. oryzae were 
further compared with the corresponding regions in 
A. sojae NBRC4239 scaffolds. We investigated 
whether the difference in a-amylase gene copy 
numbers results from the difference in genomic struc- 
tures. The results are shown in Fig. 3. In the A. sojae 
amyl region, the 12.5-kb sequence including 2.2 kb 
of amyl ORF, and its promoter and terminator 
regions were absent. Instead of the 1 2.5-kb region, a 
unique 2.9-kb sequence excluding amyl was 
present. This was also the case for amy2, where the 
1 2.4-kb region including amy2 ORF, and its promoter 
and terminator regions were absent, but a sequence 
unique as observed in the amyl region was not 
present in amy2. Furthermore, a 7.2-kb region was 
also absent and an inverted region was observed 
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Figure 3. Comparison of a-amylase genes and their flanking 
regions. Map of a-amylase and flanking regions. Grey box, 
conserved regions; shaded box, regions unique to A. sojae; 
white box, regions unique to A. oryzae ; black box, inversed 
regions; black arrow, a-amylase gene ORF; black triangle, PCR 
primer-binding site. 

near the missing amy2 regions in A. sojae. These 
results indicated that this region of the A. sojae 
genome was structurally rearranged. In the amy3 
region, we found that amy3 structural genes and its 
terminator regions were conserved between A. 
oryzae and A. sojae. However, a 1.9-kb insertion 
sequence was found at 0.53 kb upstream of the trans- 
lation initiation site in the A. oryzae amy3 promoter. 

We therefore confirmed that the genomic struc- 
tures of the a-amylase regions predicted from 
genome analysis were correct for this strain by PCR 
(Supplementary Fig. S2). These results indicate that 
the difference in copy numbers of a-amylase genes 
between A. sojae NBRC4239 and A. oryzae RIB40 is 
a result of a rearrangement in genomic structure. 

3.3.2.3. Analysis of transposons surrounding the 
a-amylase genes 

We investigated transposons existing near the a- 
amylase genes in A. oryzae and A. sojae (Fig. 4A). We 
found that the ~1.9-kb insertion sequence locating 
upstream the A. oryzae amy3 promoter is a transpo- 
son Taol (DDBJ/EMBL/GenBank accession number: 
AB021710.1). This transposon was flanked by 
inverted repeat sequences characteristic to Classll 
DNA transposons (Fig. 4A). 29 The Tao1 insertion site 
at the A. oryzae amy 3 promoter region was found to 
correspond to the 'TA' sequence in the -533 to 
-532 region upstream the A. sojae amy3 promoter. 
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This is consistent with that Classll transposons tend to 
be preferentially integrated at a TA sequence, result- 
ing in target site TA duplication on both flanking of 
the integrated transposon. 29 The Taol is located at 
the further upstream of the amylase transcription 
factor AmyR recognition site 30 and the CreA recog- 
nition sites 31 involved in carbon catabolite repression 
(Fig. 4B). It is not clear whether Tao 1 insertion affects 
the expression of the A. oryzae amy3 gene or not, and 
further study will be needed to clarify the effect of 
Taol insertion. 

The Taol transposon was also present within the 
promoters of amyl and amy2 of A. oryzae. The inser- 
tion sites were identical to amy3 insertion (Fig. 4A). 
However, Taol inserted in the amyl and amy2 pro- 
moters were largely truncated and only partial 
sequences of 575 bp at the 3'-end of the total 
1.9 kb sequence were left. The truncation of Taol in 
amyl and amy2 have occurred after triplication of 
the single amy gene having Taol transposon, which 
inserted in the A. oryzae lineage after divergence of 
A. sojae from the common ancestor. 

We also found a 981 -bp ORF similar to Antl trans- 
posase near amyl and amy2 of A. oryzae. Antl trans- 
poson is a member of Classll DNA transposon 
classified in the Tel /mariner group, 29,32 and Antl in 
A. niger is reported to have transfer activity. 33 The 
corresponding Antl transposase homologues were 
not found in the downstream regions of amy3 of 
A. oryzae or amyA of A. sojae. 

As shown above, several transposons, such as Taol 
and Antl transposase homologue, were found near 
the three a-amylase genes of A. oryzae, but were 
absent around amyA of A. sojae. The mechanism for 



the multiplication of a-amylase gene in A. oryzae is 
unclear, but these transposons might have a crucial 
role for the amylase gene multiplication in A. oryzae. 
The difference in a-amylase gene copy numbers 
might result in the difference in amylolytic activity 
between the two strains. This is likely to be a major 
factor for why A. oryzae became widely used in indus- 
try, such as in fermentation of sake, soy sauce, and 
miso, whereas A. sojae became used solely for soy 
sauce fermentation. 



3.3.3. Analysis 
genes 



of secondary metabolism-related 



3.3.3.1. Comparison with secondary metabolite 
gene clusters in A. oryzae 

The 56 secondary metabolite cluster sequences 
were predicted using SMURF in the A. oryzae RIB40 
genome. 25 We analysed these clusters for the A. 
sojae NBRC42 39 genome by Harr plots. Out of the 
56 predicted secondary metabolite clusters, 24 clus- 
ters were found to be almost identical and the 
remaining 32 clusters differed from those 
in A. oryzae RIB40 (Supplementary Table S5). 
Aspergillus sojae NBRC4239 had no sequence hom- 
ologous to Cluster 51 located on the end of chromo- 
some 5 in A. oryzae. In addition, large portions of 
Clusters 38 (non-ribosomal peptide synthetase: 
NRPS) and 47 (NRPS) were missing in A. sojae 
NBRC4239 (Fig. 5). For Cluster 38, A. sojae 
NBRC4239 had a replaced 363-bp sequence and a 
sequence containing the 5' portion of 460 bp and 
the 3' portion of 2.9 kb generated by a deletion of 
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Figure 4. Analysis of transposons surrounding a-amylase genes in A. oryzae RIB40 and A. sojae NBRC4239. (A) Expected insertion site of A. 
oryzae Taol transposon. (B) Map of transposons surrounding A. oryzae and A. sojae a-amylase genes. 
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Figure 5. Comparison of A. oryzae predicted secondary metabolite clusters with A. sojae. Secondary metabolite Clusters 38 (A) and 47 (B) 
which were predicted from A. oryzae RIB40 by SMURF were compared with sequences in A. sq/oeNBRC4239 by Harr plots. 
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Figure 6. Comparison of CPA biosynthesis gene cluster regions between A. sojae NBRC4239, A. flavus NRRL3357, and A. oryzae RIB40. 
Nucleotide sequences of CPA biosynthesis gene cluster regions were compared between A. sojae NBRC4239, A. flavus NRRL3357, 
and A. oryzae RIB40 by BLASTN and are presented diagrammatically. 



23 1 88 bp from the 26 571 bp (SC026: 1 081 838- 
1 1 08408) predicted in A. oryzae RIB40 (Fig. 5A). 
For Cluster 47, the 3' portion of 14 kb in 30 338 bp 
(SC1 02: 1249352-1279689) predicted in A. 
oryzae RIB40 was replaced by unrelated 1 7-kb 
sequence (Fig. 5B). Also, small deletions and inser- 
tions were observed in the remaining 29 clusters 
(data not shown). 

Unlike Cluster 51 (PKS) located near the end of 
chromosome 5 (SC1 1 3: 1828103-1841684) in A. 
oryzae RIB40, missing of the equivalent cluster to 



Cluster 51 in A. sojae NBRC4239 may be partly 
explained by the instability of the region near the 
chromosome end in A. sojae NBRC4239. On the 
other hand, Clusters 38 and 47 in A. oryzae RIB40, 
of which equivalent clusters that had large deleted 
portions in A. sojae NBRC42 39 are located near the 
centre of arm of chromosome 3 (SC26: 1 081 838- 
1 1 08408 of 2 324 132 bp) and near the centromere 
of chromosome 4 (SC1 02: 1 249352-1 279689 of 
1 779 707 bp), respectively. Therefore, the reasons 
for the depletion of these secondary metabolite 
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gene clusters may be different from that for missing of 
the equivalent cluster to Cluster 51 at the chromoso- 
mal end. 



3.3.3.2. Analysis of CPA gene cluster regions 

Gene cluster regions for CPA biosynthesis 
in A. flavus and A. oryzae were analysed for genomes 
of A. sojae NBRC4239, A. flavus NRRL3357, and 
A. oryzae RIB40 by BLASTN. 34 The results are shown 
in Fig. 6. In A. flavus and A. oryzae, CPA clusters were 
found at the end of chromosome 3, next to the ana- 
toxin biosynthesis gene clusters (Fig. 6). Genes mfsl , 
maol , dmaT, Pks-nrps, and cttrl shown in the figure 
are considered to be involved in CPA biosynthesis. 
Since the large portion of Pks-nrps at the telomere 
side is deleted, CPA cannot be synthesized in A. 
oryzae RIB40. 34 On the other hand, 18 508 bp of 
the CPA biosynthesis cluster region was found to be 
deleted in A. sojae NBRC42 39, thus most of the 
mfsl, maol, dmaT, and Pks-nrps sequences were 
missing (Fig. 6). Furthermore, a 7653-bp sequence 
including ord1, ord2, and ord3 is present 2 5 kb 
distant from the CPA cluster toward the telomere 
side in A. flavus. This sequence was found to be 
inserted inversely next to the missing CPA gene 
cluster in A. sojae. 

PCR was carried out to confirm the missing A. sojae 
CPA biosynthesis gene cluster region and the inverted 
insert (Supplementary Fig. S3). These results 



confirmed the 1 8.5-kb deletion and the inverted 
7.6-kb insertion found in the A. sojae NBRC4239 
genome. 

In addition to the finding of complete deletion of 
mfsl, maol, and dmaT, the present analysis also 
found the deletion of a promoter and the half of the 
ORF containing the ketoacyl synthase (KR) domain 
and the acyltransferase (AT) domain for Pks-nrps in 
A. sojae. The genes maol, dmaT, and Pks-nrps are 
essential for CPA biosynthesis in A. flavus. 26 
Therefore, the present data lead to the conclusion 
that/4, sojae is unable to produce CPA, which also veri- 
fies the safety of A. sojae for use in industry. 

3.3.3.3. Analysis of aflatrem biosynthesis gene 
cluster regions 

The A. sojae genome was analysed for the aflatrem 
biosynthesis gene cluster found in A. flavus. 27 
Aflatrem biosynthesis genes in A. flavus are known to 
consist of genes required to synthesize the intermedi- 
ate paspaline (atmG, atmC, atmM, and atmB) as well 
as genes required to convert paspaline to aflatrem 
(atmP, atmQ, and atmD), which are encoded at two 
separate loci ATM1 and ATM2 in A. flavus, respect- 
ively. 27 ATM1 (34 81 6 bp) and ATM 2 (2 5 2 56 bp) 
were analysed for the A. oryzae and A. sojae genomes 
by BLASTN. We found almost identical sequences to 
ATM1 and ATM2 with a few base substitutions in A. 
oryzae. In contrast, five gaps with >100bp were 
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found in the corresponding region to ATM1 locus in A 
sojae (Fig. 7A). In Gap 4, a 1 140-bp sequence in A. 
flavus was replaced by unrelated 6138-bp sequence 
in A. sojae (Fig. 7A). Nine gaps with >1 00 bp were 
also observed in the corresponding region to ATM2 
locus in A. sojae (Fig. 7B). In Gap 7, a 2761 -bp 
sequence in A. flavus was replaced by unrelated 
1 1 493-bp sequence in A. sojae (Fig. 7B). All these 
gaps found in A. sojae were present in the non- 
coding regions. A frameshift due to single base inser- 
tion in exon 7 of atmQ was reported to account for 
the non-productivity of aflatrem in A. oryzae, 27 but 
such mutation was not found in A. sojae in this study. 
The presence of insertions in A. sojae NBRC4239 was 
confirmed by PCR (Supplementary Fig. S4). 

In this study, we showed that A. sojae NBRC4239 
had many differences in ATM1 and ATM2 loci includ- 
ing deletions and insertion of unrelated sequences in 
comparison with those in A. oryzae and A. flavus, 
where both loci are well conserved. In addition to 
these differences, more than 1 0 gaps <1 00 bp were 
also observed in the corresponding loci in A. sojae 
NBRC4239 (data not shown). As described above, 
all the 14 gaps were present in the non-coding 
regions in A. sojae. To date, production of aflatrem in 
A. sojae has not been reported but the present study 
did not provide the evidence for aflatrem non-pro- 
ductivity from the sequence information. Further 
analysis will be needed to solve this discrepancy on 
the aflatrem production in A. sojae. 
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